SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data

Abstract Survival analysis is a branch of statistics that deals with both the tracking of time and the survival status simultaneously as the dependent response. Current comparisons of survival model performance mostly center on clinical data with classic statistical survival models, with prediction accuracy often serving as the sole metric of model performance. Moreover, survival analysis approaches for censored omics data have not been thoroughly investigated. The common approach is to binarize the survival time and perform a classification analysis. Here, we develop a benchmarking design, SurvBenchmark, that evaluates a diverse collection of survival models for both clinical and omics data sets. SurvBenchmark not only focuses on classical approaches such as the Cox model but also evaluates state-of-the-art machine learning survival models. All approaches were assessed using multiple performance metrics; these include model predictability, stability, flexibility, and computational issues. Our systematic comparison design with 320 comparisons (20 methods over 16 data sets) shows that the performances of survival models vary in practice over real-world data sets and over the choice of the evaluation metric. In particular, we highlight that using multiple performance metrics is critical in providing a balanced assessment of various models. The results in our study will provide practical guidelines for translational scientists and clinicians, as well as define possible areas of investigation in both survival technique and benchmarking strategies.


Regarding 1. Fair comparison and parameter selection
The altered study design appears much better suited to this end. Thank you very much for the effort, in particular the additional results regarding the two tuning approaches. Although I think a single simple tuning regime would be feasible here, using the default settings is reasonable and very well justified. I agree that this is much closer to what is likely to take place in practice. However, it should be more clearly emphasized that better performance may be achievable if tuning is performed. Response: Thank you for this comment. We agree that it is important to inform readers that suitable tuning may achieve better performance for a particular method.
We now add to line 414 the following: "In this study we did not survey the extensive tuning procedures for those survival approaches. The main reason being that, in practice, often the default hyperparameter sets are used. Therefore, we also decided to use the default sets in our study here. However, we note that applying targeted tuning methods for a particular dataset may lead to different performances for the considered approaches. " Regarding 2. Description Thanks, all concerns properly addressed. No more comments.
Regarding 3. Reliability I am aware that Figure 2c provides information to this end. I think additional boxplots which aggregate the methods' performance (e.g. for unoc and bs) over all runs and datasets would provide valuable additional information. For example, from Figure 2c one can tell that MTLR variants obtain overall higher ranks based on mean prediction performance than the deep learning methods. However, it says nothing about how large the differences in mean performance are. Response: Thank you for this comment. We agree that boxplots will provide complementary information especially on both the variation and the mean. In our Figure 2c, stability ranking (SD_unoc, SD_bs) measures the variation, i.e. the differences in mean performances. We have chosen to illustrate the information in the main manuscript with the heatmap because we have 16 (datasets) *20 (accuracy metrics)=320 boxplots in total. Each of them has 19 boxes representing the 19 survival approaches. That being said, we have now provided boxplots for unoc and bs in Supplementary  Kaplan-Meier-Estimate (KM) I'm not quite sure I understood the authors' answer correctly. The KM does not use variable information to produce an estimate of the survival function, and I think that is why it would be interesting to include it. This would shed light on how valuable the variables are in the different data sets. Response: Thank you for clarifying this question.Our study does not focus on comparing feature importance differences in different methods, nor does our study compare variable contributions across the datasets. Contrasting variable importance rankings among different methods is beyond the scope of this benchmark study. We would expect such empirical research to demonstrate that finding would be data-dependent.
Our understanding of the KM method is that Kaplan-Meier curves describe the whole population (data) or describe a subpopulation thereof. Therefore, the KM method does not provide predicted risks or survival probabilities for each individual (observation) beyond learning from to what subpopulation an individual belongs to. KM curves are commonly used to compare two or more subpopulations in the data [ Ref 1]. We are aware of one benchmark study [Ref 2] that uses the KM method together with multivariate survival methods. However, when we looked into how the prediction based on KM curves was calculated, we concluded that each observation was assigned the same predicted risk in the implementation, which would contribute little to the benchmark study as a baseline method. Regarding 4. Scope and clarity Thanks, all concerns properly addressed. No more comments.
Minor points: -Since the authors decided to change 'framework' to 'design', note that in Figure 1b  -Please elaborate more on how similarity (reflected in the dendrograms) is defined? Response: Similarity in the degrograms is calculated by the Euclidean distance among those metrics with respect to methods represented by a binary matrix, where 0 means that a metric is not feasible for a method and 1 means a metric is feasible for a method.
We now add the following in our Figure 2 a and b legend: "Similarity is defined using the Euclidean distance with feasibility indicator 0 and 1. " -Why is the IBS more similar to Bregg's and GH C-Index than to the Brier Score? Response: Packages' availability to calculate the metrics in practice). Therefore, Figure 2b does not show that Figure 2b shows whether a metric is applicable (feasible) for a particular method (based on the R IBS as a prediction accuracy metric performs similarly with Begg's and GH C-index. Instead, it says that those three metrics (IBS, Begg's and GH C-index) are not applicable to those methods.
-Why is the IBS not feasible for so many methods, in particular Lasso_Cox, Rdige_Cox, and CoxBoost? Response: To clarify, our original usage of the phrase "feasible" is in hindsight better phrased as "readily applicable". In the revised manuscript, we now refer to how readily the methods can be evaluated by the various metrics, i.e. whether the corresponding R functions can be readily applied to our data to calculate the various evaluation metrics (Supplementary Table 1 provides details of the functions/packages used for each evaluation measurements). For example, the IBS calculation (the "crps" function in the R package "pec"; Ref 3) requires a specific type of input that is related to the training model. Currently, the acceptable models must have a "predictSurvProb" method and this is only generated for the Cox model (with "coxph" class) and Random Survival Forest model (with "rfsrc" class). We now re-phrase "feasibility" in our manuscript to "readily applicability" and modify Figure 2b legend to: "(b) Prediction ability evaluation metric readily applicable. Row: methods; Column: evaluation metrics; legend: readily applicability where red (1) means readily applicable and blue (0) means not readily applicable."

62
Among the comparative studies that included real-world datasets in health, we found that most have a 63 specific focus such as on a certain disease (e.g. colon cancer), or on a certain data platform (e.g. omics

107
The penalised Cox model is another extension of the CoxPH model that helps to prevent overfitting.

108
The L1 regularized CoxPH model adds a scaled sum of absolute values of the magnitude of model 109 coefficients, that is 1 ∑ | | =1 , as the regularization term to the partial log-likelihood. Other 110 regularizers can be used such as L2 regularization, that is 2 ∑ ( ) 2

=1
, or other scaled sums of non-111 negative penalties of the ′ , such as in the following general penalised partial log-likelihood:    where the differences come from the different ways that censored subjects are ranked.

202
We will use the following three concordance indices: Begg's C-index, Uno's C-index and GH C-index.   Table 1).

230
 Veteran data is a survival dataset from the randomised trial of two treatment regimens for lung 231 cancer obtained from the R package "survival". There are 6 measured features in this data.  first is a protein expression dataset from the iTRAQplatform and the second is a Nanostring dataset 267 from the above melanoma study and pre-processing steps are described in the respective papers.

268
The itraq protein expression data has 41 patients with 640 proteins. The nanostring data has 45 269 patients with 204 genes [49], and the GEO ID is "GSE156030".

294
Benchmarking methods: All methods evaluated are described in

321
(2) According to the specific data modality, select the feasible methods (m in total) and obtain their 322 rank (recorded in a m by q matrix) for the aspects considered in the "weight vector".

323
(3) Multiply the rank matrix and the "weight vector" to obtain the final selected method from this list 324 of m scores (Supp Table 3_Supplementary

449
Further examination of the aspects that affect model predictability can be found in Supp Table   450 2_Supplementary Material.

451
Deep learning based methods failed for some datasets on some cross-validation runs. Taking